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IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 

In the matter of: 



AppL No. : 09/934,156 Confirmation No, 7387 

Applicant : David Roth Rigney 

Filed : August 21, 2001 

TC/A.U. : 1631 

Examiner : Cheyne D. Ly 

Response to : Office Action with mailing date of January 08, 2004 



DECLARATION OF DAVID H RIGNEY 
INCLUDING ATTACHMENTS 1. 2. AND 3 

I, David R. Rigney, declare as follows: 

1 . I am the inventor named in the patent application referenced above. I make this 
Declaration based on my personal knowledge, and could and would testify competently to the 
facts stated herein. 

2. My professional background has included graduate training in bioengineering, a doctorate 
in physics with a specialization in biophysics, postdoctoral training in biophysics, professional 
appointment with the title of biophysicist, appointment as Assistant Professor at Harvard 
University in the School of Medicine with a joint appointment at the Massachusetts Institute of 
Technology in the Division of Health Sciences and Technology, appointment on the staff of 
Boston Beth Israel Hospital, and Vice-President for Research and Development for the company 
GENETWORKS Inc. 

3. While on the faculty of Harvard Medical School, I was the chairman of a faculty 
committee that was responsible for overseeing a resource that provided computer hardware and 
software support to molecular biologists who were on the faculty of Harvard Medical School and 
who had laboratories situated at Boston Beth Israel Hospital. In addition to acting in that 
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supervisory capacity, as a professional courtesy, I also personally provided hardware and software 
support to molecular biologist colleagues who worked in laboratories at Boston Beth Israel 
Hospital. I also ran my own cell and molecular biology laboratory in which I provided my own 
hardware and software support. 

4. In connection with the duties described in paragraph 3 above, I joined a group called the 
Boston Area Molecular Biology Computer Types (BAMBCT), A description of BAMBCT is 
provided in Attachment 1, which is a copy of a web page "Welcome to BAMBCT" that I 
downloaded at http://genetics.mgh.harvard.edu/l^ on June 18, 2004. 

5. I believe that the members of BAMBCT (or its equivalent in areas outside of Boston) 
collectively constitute representative artisans for the art that is used in the patent application 
referenced above (self-described "molecular biology computer types" who provide hardware and 
software support to university molecular biology departments or BioTech companies), 

6. Although I last had contact with BAMBCT in 1 997 or 1998, 1 have no reason to believe 
that turnover of the members of the group is such that the range of backgrounds of the members 
of the group was any different at the time of the instant invention than it was in 1997 or 1998. If 
specification of a group of artisans is required for the period between 1997 or 1998 and the time 
of the instant invention, I would specify similarly self-described artisans in Austin, Texas over that 
time period, about whom I am personally familiar, but who apparently do not meet and confer as 
an organized group. The current description of BAMBCT shown in Attachment 1 is identical to 
my understanding of what BAMBCT was at all times that I was one of its members. 

7. The BAMBCT mission statement shown in Attachment 1 refers to a most frequent 
common denominator among members of BAMBCT, which is that most people run "GCG". This 
reference is to a software package known as the GCG Wisconsin Package. The GCG Wisconsin 
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package is described in Attachment 2, which is a copy of a web page "GCG Wisconsin Package* 5 
that I downloaded at the web site http://www.accelrys.conVproducts^ on 
June 19, 2004. The features described in Attachment 2 are substantially the same as those 
described in 1998 in B.A. Butler, "Sequence Analysis Using GCG 55 , Chapter 4 (pp. 74-97) in A.D. 
Baxevanis and B.F.F. Ouellette, eds., Bioinformatics. A Practical Guide to the Analysis of Genes 
and Proteins . New York: Wiley Interscience, 1998, except that the current user interface is more 
sophisticated. The 1998 chapter by Butler is provided in Attachment 3 to this declaration, 
8. Although I have never investigated in detail the backgrounds of the members of 
BAMBCT, my general impression is that about half of the group have doctorates in a technical 
subject and the other half have bachelors or masters degrees in a technical subject that requires 
some practical knowledge of computer hardware and software (computer science, physics, 
chemistry, engineering). The members of BAMBCT would ordinarily be able to write simple 
computer programs (scripts) for pre-existing software (like the GCG Wisconsin package) but 
would not in general be engaged in the writing of whole, compiled computer programs that 
require the design and implementation of a new algorithm. The members of BAMBCT would 
ordinarily be able to install off-the-shelf commercial software products, but would not necessarily 
be able to install non-commercial software without the debugging assistance of a non-biological 
computer systems analyst/programmer within their organization. The members of BAMBCT 
would ordinarily be experienced with sequence analysis as implemented, for example, in the GCG 
Wisconsin Package. BAMBCT members would not in general be technically experienced with the 
methods of gene expression analysis (Northern blots, microarrays, RT-PCR, etc.), although 
members would have some familiarity with the basics of such methods. I do not recall there ever 
being any mention, at a BAMBCT meeting or in BAMBCT email or later by any of the artisans in 



3 



Austin, Texas, about natural language processing, as described, for example, by Manning and 
Schutze(1999). 

9. I believe that if "the artisan of ordinary skill in the art at the time of the instant invention" 
is taken to be a randomly selected member of BAMBCT or its equivalent in areas outside of 
Boston, then it would not have been obvious to that artisan to combine Andrade et al (1999) with 
McCallum (1998) to make the instant inventioa This is primarily because that artisan would not 
be expected to have any prior training or experience in the technical aspects of McCallum (1998) 
such as the Naive Bayes concept; because that artisan would not necessarily be able to write a 
whole, compiled computer program that requires the design and implementation of a new 
algorithm; and because that artisan would not be able to conceive the convoluted sequence of 
changes needed to transform the combined Andrade et al and McCallum references, taken as a 
whole (including references cited by Andrade et al), into the instant invention, taken as a whole, 
without inadmissible hindsight. 

10. The disclosure by McCallum states on its page 1, lines 6-7, that "Several of the 
examples also assume that you have downloaded the 20 newsgroups data set, unpacked them in 
your home directory, and therefore that its files are available in the directory ~/20_newsgroups." I 
performed this step as indicated above. I have also counted the words in each of the 20 groups, 
using the djgpp (unix) utility 4 Svc" (word count). The number of words in the text corpus 
corresponding to each of the sample classes is as follows, as an indication of the size of the text 
corpus with which the program Rainbow is expected to work. I believe that this number of words 
is several orders of magnitude larger than the number of words to be found in the annotations or 
dictionary described in Andrade et al, so that there would not be a reasonable expectation of the 
useful or successful application of the Rainbow software described by McCallum to the 
annotations or dictionary of Andrade et al. 
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Glass Name Number of Words in the 


ait atheism 


354053 


comp.graphics 


278585 


comp.os.ms-windows.misc 


234915 


comp.sys.ibnxpc.hardware 


216999 


comp.sys.mac.hardware 


203473 


comp.windows.x 


305914 


misc.forsale 


164281 


rec.autos 


237731 


rec.motorcycles 


217896 


rec.sport.baseball 


249160 


rec.sport.hockey 


301787 


sclcrypt 


348884 


sclelectronics 


225887 


sci.med 


313044 


scLspace 


310385 


soc.religionxhristian 


404170 


talk.politics.guns 


356830 


talk.politics.mideast 


523816 


talk.politics.misc 


436764 


talk.religion.misc 


362082 



The documents provided as Attachments 1-3 of this declaration are true and exact copies 
of what they are intended to represent. I declare under penalty of perjury under the laws of the 
United States of America that all the foregoing is true and correct. 

Signed in Austin, Texas on July 7, 2004: 



David R. Rigney 



CERTIFICATE OF EXPRESS MAIL UNDER 37 CF.R 1.10 

I hereby certify that this paper is being deposited with the United States Postal Service "Express Mail Post Office to 
Addressee" service on the date indicated below and is addressed to Commissioner for Patents, P.O. Box 1450, 
Alexandria VA 223 13-1450. 



Printed Name: David R. Rigney 

Signature: f^UtfC^ £ (Z^Td^/ 



Date of Deposit: July 7, 2004 

Express Mail Label No. ER 826633934 US 
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bambct-mission.html 
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Welcome to BAMBCT (Boston Area Molecular Biology Computer Types). 

We are both a real Bioinfbrmatics mutual support group (meetings monthly at the best local 
mini-breweries for beer-drinking and discussions of hardware and software) as well as a virtual 
community of about 40 members communicating via email. On the average about 10 people attend 
monthly meetings at the local pubs. Most of us provide hardware and software support to university 
molecular biology depts or BioTech companies. The group started out with most people running GCG, 
but some now run DNA* and other programs or specialize in sub-areas of 
molecular biology computing. Some of us run DNA sequencing or synthesis facilities etc. 

Access to the virtual community is via sending email to 

bambct-list@molbio.mgh.harvard.edu 

to directly get to all members on our mail exploder list. 

You should post almost anything reasonable-job openings, requests for help, tips on new 
programs or Web sites, etc. 

On occasion we have seminars or show-and-tell meetings to discuss issues 

many of us are interested in. We have had presentations on uses of 

Webservers in our Departments, large-scale DNA sequencing contig 

assembly, new GCG programs (presented by GCG), open source bioinformatics software. 

Your suggestions are welcome. 

Lance 

Return to BAMBCT home page. 
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^accelrys" 



ABOUT ACCELRYS 



SCIENCE 



INDUSTRIES 



CUSTOMER DESK 



WebStore | Site Map | Contact Us 



Products 

Modeling/Simulation 
Informatics 
Client Services 
Consortia 
Desktop products 
Product finder 
S pecial offers 
WebStore 



Home > Products & Services > Informatics > GCG Wisconsin Package 

GCG Wisconsin Package 

On this page: Software Review / Interfac es / Advantages / To ur of 
Highlighted Programs / Reouirement s 

Related Links: Version 10. 3,1 Patch Now Available / What's N ew in 
10.3 / PS SeoStore / Data Update Services / Transcri ption factor dat a 
files 



Search 



SeqLab, free with the Wisconsin 
Package, provides a graphical 
interface to the Package's 
analysis tools plus project 
management capabilities. 
Seqlab's Editor (shown above) 
enables you to enter sequences, 
view multiple sequence 
alignments, as well as select the 
sequence ranges to analyze. 



Molecular biologists worldwide use the GCG® Wisconsin Package® as 
their software of choice for comprehensive sequence analysis. The 
Wisconsin Package meets research needs across disciplines, project 
teams, and labs to provide an enterprise-wide solution. Based on 
published algorithms from the fields of mathematical and 
computational biology, the Package includes tools for: 

• Comparison 

• Database Searching and Retrieval 

• DNA/RNA Secondary Structure 

• Editing and Publication 

• Evolution 

• Fragment Assembly 

• Gene Finding and Pattern Recognition 

• Importing and Exporting 

• Mapping 

• Primer Selection 
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• Protein Analysis 

• Translation 

In addition to running within the UNIX operating system, the 
Wisconsin Package also runs on Intel-based x86 personal computers 
with Red Hat Linux 7.1 or 7.2. With the exception of the PAUP family 
of programs, the Linux edition of the Package provides the same 
functionality as the UNIX edition. PDF datasheet (A4, US Letter ). 

The Wisconsin Package is licensed op a per server basis. Contact your 
sa les r epresentative for more information. 

A relational database version of the Wisconsin Package is available for 
use with PS Seq Store, our Oracle®-based data management and 
mining system. In addition to its sequence analysis capabilites, DS 
SeqStore includes tools to establish in-house relational databases; 
receive automated sequence data updates; and set up automated 
sequence analysis pipelines. 

If your reseachers require access up-to-date public data behind the 
security of your institution's firewall, consider subscribing to one of 
our d ata update services . These services provide daily or bimonthly 
delivery of publicly-available sequence data already formatted for use 
with the Wisconsin Package. 

Software Review 

HMS Beagle— online magazine for the BioMedNet organization. 
Please follow this link (free registration required). 

Interfaces 

Three interfaces are available for the Wisconsin Package: SeqLab® 
and the command-line interfaces come with the Package while 
SeqWeb® is licensed separately. 

SeqLab Supplied with the Package, SeqLab supplies a 

graphical user interface to the Wisconsin Package. 
SeqLab requires an X Windows display, such as an X 
server running on a PC or Macintosh, an X terminal, 
or a workstation that runs X Windows. 

SeqLab provides an interactive sequence editor, 
convenient project management capabilities as well as 
a friendly interface for using Wisconsin Package 
programs. SeqLab supplies a rich visual display of 
sequences by individual bases or residues or by 
known sequence features that makes it easier to edit 
sequences or create and manipulate sequence 
alignments, In addition, you can click and drag to 
highlight multiple sequences or regions within 
sequences upon which to perform some analysis. 
SeqLab also makes it easy for you to annotate 
sequences due to your analysis or comparison with 
other sequences and their features. And if you have 
other programs that meet your needs, you can 
integrate them into SeqLab for ease of use and to 
create a common interface among them. 

SeqLab's pull-down menus let you choose programs 
to manipulate the sequence(s) you have chosen. 
When you select a program from a pull-down menu, a 
separate window specific for that program appears. 
The program window includes a short message 
describing the program and presents all necessary 
input. 
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Command The command-line interface enables you to run 
Line programs from the UNIX system prompt. All 

Wisconsin Package programs run in a similar manner; 

if you know how to run one, you know how to run 

them all. 

Typical Scenario. Each program requires specific 
information to run successfully. When you run a 
program , it prompts you for an input file or asks you 
to answer questions'with a yes or no or fill in a 
number or letter from a menu of available choices. 
The program suggests an answer for each prompt, 
except for the input file, allowing you to simply press 
<Enter> without typing a response. 

Most programs have fewer than six prompts. Many 
program features are available as optional program 
modifiers. This design allows you to concentrate on 
the analysis that interests you without having to sift 
through many program modifiers each time you run 
the program. 

SeqWeb SeqWeb, an add-on product to the Wisconsin 

Package, allows you to connect to popular Wisconsin 
Package programs via Netscape® or Internet 
Explorer®. With SeqWeb, you can directly import files 
in a variety of formats; choose from a list of critical 
parameters for each program; link to databases on 
your local intranet or on the Internet from within 
program output; and run multi-program analyses in 
just one step. For more information, see Seq Web . 

Advantages 

■ Well Established 

Researchers worldwide use the Wisconsin Package and collaborate 
with Accelrys to provide programs that meet your needs. 

■ Breadth of Analysis 

The Wisconsin Package is the most comprehensive sequence analysis 
software available. Instead of using multiple software tools to achieve 
a final result, output from one Package program often acts as an input 
to others, providing a flow of analyses within a single interface. 

a Enterprise-Wide 

Multi-user environment allows an unlimited number of scientists within 
your organization to share software and data. 

■ Expertise 

Our bioinformatics support staff is both highly rated and trained to 
provide scientific and technical expertise. 

■ Current Data 

We offer up-to-date access to the nucleic acid sequence databases 
GenBank and GenEMBL and the protein sequence databases PIR, 
SP-TrEMBL, SWISS-PROT, GenPept, NRL_3D, and Pfam. 

■ Extendable Framework 

Extensions enable you to plug other in-house or third-party software 
into SeqLab to provide a common interface and easy access. 

■ Legacy of Commitment 
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Since the beginning of bioinformatics, we have supported molecular 
biologists* research needs through software, education, and support, 
and continues to incorporate new technologies as this research 
discipline develops. 

Tour of Highlighted Programs 

The Wisconsin Package programs canvas a wide range of scientific 
interests, including sequence entry and fragment assembly, mapping, 
database searching, multiple sequence and evolutionary analysis, 
pairwise comparison, gene finding, DNA/RNA and protein secondary 
structure, translation, and display. Following are select Wisconsin 
Package programs grouped by function. Images are from the 
command-line interface. For a complete list of programs, click here. 



Sequence Entry 



• SeqEd is an 
interactive editor for 
entering and 
modifying sequences. 

• Individual sequences 

in Staden, EMBL, 

GenBank®, PIR®, SeqEd enables you to view 

IntelltGenetics, and and change sequences. 

FastA formats can be 

changed to Wisconsin 

Package format with 

the programs 

FromStaden, 

FromEMBL, 

FromGenBank, 

FromPIR, FromlG, 

and FromFastA. 

Sequences in other 

formats can be 

entered using the 

Reformat program. 



Mapping 



Prime selects 
oligonucleotide 
primers for a 
template DNA 
sequence based on 
primer melting 
temperatures (Borer 
et al.), 

thermodynamic 
parameters for DNA 
(Breslauer et al.), 
PCR product melting 
temperatures 
(Baldino et al.), 
annealing 

temperatures of PCR 
primer pairs (Rychlik 
et al.), and self- or 
pair-annealing testing 
(Hillier and Green). 



Prime selects 

oligonucleotide primers for 
a template DNA sequence. 




Ma'p displays enzyme 



6/19/04 10:23 AM 



xelrys: GCQ Wisconsin Package 



http://www.ac€elrys.con^pr^ 



• Map displays enzyme 
restriction sites above 
both strands of DNA 
along with protein 
translations below the 
DNA (Schroeder and 
Blattner). 

• MapPlot displays 
restriction sites 
graphically. 

• MapSort lists, by size, 
the fragments of 
single or multiple 
restriction enzyme 
digests. 



restriction sites as text. 



MapPlot displays enzyme 
restriction sites graphically. 



Fragment Assembly 

A set of programs based on 
the methods of Staden let 
you enter, assemble, and 
view overlapping nucleotide 
fragments to create a single 
continuous sequence. 

• GelMerge uses the 
method of Wilbur and 
Lipman to find 
overlapping regions 
among the fragments 
and the method of 
Needleman and 
Wunsch to align the 
fragments. A key 
option allows you to 
excise vector 
sequences. 

• GelAssemble is a 
multiple sequence 
editor for viewing and 
editing contigs, or 
aligned assemblies of 
sequence fragments, 
assembled by 
GelMerge. 

• GelView displays a 
schematic view of all 
contigs and their 
fragments in a 
project. 

• GelDisassemble 
breaks up all contigs 
into their original 
fragments. 




GelAssemble enables you 
to view and modify contigs. 




GelView displays contigs 
graphically. 
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Database Searching 

• BLAST is a fast, 
statistically driven 
sequence searching 
and alignment tool 
(Altschul et al.) that 
can search databases 
on your computer or 
those maintained at 
the National Center 
for Biotechnology 
Information (NCBI) or 
at other regional 
BLAST servers. 

• PSIBLASTuses 
position-specific 
scoring matrices 
(PSSMs) to score 
matches between 
query and database 
sequences, in 
contrast to BLAST, 
which uses 
pre-defined scoring 
matrices such as 
BLOSUM62. PSIBUVST 
may be more 
sensitive than BLAST, 
meaning that it may 
find distantly related 
sequences not found 
with a BLAST search 




BLAST searches for 
sequences similar to a 
query sequence. 




Motifs searches protein 
sequences for defined 
patterns. 



• FastA provides a 
more sensitive 
sequence searching 
and alignment tool 
(Pearson and 
Lipman). Variations of 
FastA include 
SSearch, TFastA, 
TFastX, and FastX. 

• FindPatterns locates 
short ambiguous 
sequences like 
transcription factors 
in a database or set 
of sequences. 

• Motifs searches sets 
of protein sequences 
or protein databases 
for the patterns 
defined in PROSITE 
(Bairoch). 

• StringSearch searches 
through the sequence 
database references 
to locate sequences of 
interest, for example 
all human globin 
sequences. 
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DataSet creates 
personal sequence 
databases for use by 
the database 
searching programs. 



Pairwise Comparison 

• DotPlot graphically 
displays an alignment 
between two 
sequences based on a 
number of matches 
within a given range 
(Maizel and Lenk) or 
complete matching 
over an entire short 
range (Wilbur and 
Lipman). 

• BestFit finds the best 
segment of similarity 
between two 
sequences (Smith and 
Waterman). 

• Gap finds the 
complete alignment of 
two sequences 
(Needleman and 
Wunsch). 




DotPlot graphically displays 
the alignment between two 
sequences. 



Multiple Sequence 
Analysis 

• PileUp creates a 
multiple sequence 
alignment of up to 
500 sequences using 
the method of Feng 
and Doolittle, similar 
to the method of 
Higgins and Sharp. A 
dendrogram 
illustrating sequence 
similarity is also 
created using the 
strategy of Sneath 
and Sokal. 

• 

ProfileMake creates a 
quantitative 
representation (a 
profile) of a family of 
aligned sequences 
that gives extra 
weight to parts of the 
alignment that are 
conserved across the 
family (Gribskov et 
al., 1987). The profile 




PileUp creates a multiple 
sequence alignment for up 
to 500 sequences. 




PileUp can also create a 
dendogram to graphically 
show sequence similarity. 
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can be used by 
ProfileSearch to 
search databases to 
find other members 
of the family 
(Gribskov et al., 
1990). 

• LineUp is an 
interactive editor for 
editing multiple 
sequence alignments. 

• Pretty displays 
multiple sequence 
alignments. 

• MEME (Multiple EM for 
Motif Elicitation - 
Timothy Bailey and 
Charles Elkan, 
University of 
California, San Diego) 
finds conserved 
motifs in a group of 
unaligned sequences 
and saves these 
motifs as a set of 
profiles. You can 
search a database of 
sequences with these 
profiles using the 
MotifSearch program. 

• NoOverlap identifies 
the places where a 
group of nucleotide 
sequences do not 
share any common 
subsequences. 



Evolutionary Analysis 

• PAUPSearch provides 
a Wisconsin Package 
interface to the 
tree-searching 
options in PAUP 
(Phylogenetic Analysis 
Using Parsimony). 

• PAUPDisplay provides 
a Wisconsin Package 
interface to tree 
manipulation, 
diagnosis, and display 
options in PAUP. 

• 

Distances writes a 
matrix of the pairwise 
evolutionary distances 
between aligned 
sequences. To correct 



i - i- i 

GrowTree creates a 
phylogenetic tree for a 
group of aligned 
sequences. 
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for multiple 
substitutions several 
methods may be 
chosen: for nucleic 
acid sequences, 
Kimura's 
two-parameter 
method (1980), the 
Tajima and Nei 
method, and the Jin 
and Nei method; for 
protein sequences, 
the Kimura method 
(1983); and for either 
type of sequence the 
Jukes and Cantor 
method. 

• GrowTree creates a 
phylogenetic tree 
using 

neighbor-joining 
(Saitou and Nei) or 
UPGMA (Sneath and 
Sokal). 



Gene Finding 

• CodonPreference 
identifies and displays 
possible protein 
coding regions based 
on similarity of the 
codon usage in the 
sequence to a codon 
frequency table 
(Gribskov et al., 
1984). Third position 
bras in the codon can 
also be displayed. 

• TestCode uses the 
statistical method of 
Frckett based on the 
period three 
compositional 
constraints in the 
entire nucleic acid 
database to identify 
and display protein 
coding regions. 

• Frames displays open 
reading frames for 
the six DNA 
translation frames. 




CodonPreference finds 
possible protein coding 
regions. 



i- i— i b 

TestCode plots a measure 
of the non-randomness of 
the composition at every 
third base. 
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Frames displays open 
reading frames for the six 
DNA translation frames. 



DNA/RNA Secondary 
Structure 

• MFold is an 
adaptation of the 
"mfold" package of 
Zuker and Jaeger 
which predicts 
optimal and 
suboptimal secondary 
structures for an RNA 
or DNA molecule 
(Zuker, and Jaeger et 
al.). 

• PlotFold provides six 
ways to graphically 
display the optimal 
and suboptimal 
secondary structures 
calculated by MFold: 
energy dotplot, 
p-num, circles, dome, 
mountain, and 
squiggle) plots. 




PlotFold's energy dotplot is 
a two-dimensional graph 
where both axes represent 
the same RNA sequence 
and each point on the 
graph indicates a base pair 
between the 
ribonucleotides whose 
positions in the sequence 
are the coordinates of that 
point on the graph. 




PlotFold's circles plot is a 
circular Nussinov graph of a 
nucleic secondary structure 
that shows the sequence as 
a segment of the circle. 



I- l—i a 

PlotFold's squiggles plot 
represents the bonds 
formed between bases as 
chords. 
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Translation 

• Nucleotide sequences 
are translated into 
peptides using the 
Translate program. 

• Peptide sequences 
are backtranslated 
into nucleotide 
sequences with the 
BackTranslate 
program. 



Translate translates 
nucleotide sequences into 
peptides. 



Display 



• Publish arranges 
sequence data for 
publication. 

• PlasmidMap displays 
a circular plot of a 
plasmid construct. 




Publish produces 
publication-ready output. 




PlasmidMap creates a 
circular map of a plasmid 
construct. 



Protein Analysis 
• 

PepPlot plots all of 
the standard 
measures of protein 
secondary structure: 
alpha-helix and 
beta-sheet prediction 
(Chou and Fasman, 
and Gamier et aL), 
hydrophobic moment 




PepPlot plots all of the 
standard measures of 
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(Eisenberg et al.), 
and hydropathy (Kyte 
and Doolittle, and 
Engelman et al.). 

• PlotStructure can 
display several 
standard measures of 
secondary structure 
as well as antigenic 
index (Wolf et al.) 
and surface 
probability (Emini et 
al.). 

• HelicalWheel arranges 
the residues of a 
protein into a helix of 
adjustable angle and 
identifies hydrophobic 
residues. 

• Isoelectric plots the 
charge of a peptide 
as a function of pH 
and calculates the 
isoelectric point. 

• The Moment program 
helps find amphiphilic 
regions that coincide 
with an alpha-helix or 
beta-sheet structures 
(Eisenberg et al.) 

• TransMem builds on 
the method of 
Sonnhammer et al, to 
predict likely 
transmembrane 
helices in one or more 
input proteins. The 
method is based upon 
a Hidden Markov 
Model (HMM) that has 
been trained on a set 
of membrane proteins 
with helical 
membrane spanning 
regions 



protein secondary 



structure. 










laJbsJ 


E3 



PlotStructure displays 
several standard measures 
of secondary structure, 
antigenic index, and 
surface probability. 




HelicalWheel arranges the 
residues of a protein into a 
helix and identifies 
hydrophobic residues. 



Requirements 

Recommended Platforms for New Installations or Upgrades 

Version 10.3 of the Wisconsin Package can be installed on a UNIX host 
system with users connecting to the host system via a modem or 
direct connection. The Package can also be installed on a personal 
computer running Red Hat Linux 7.1 or 7.2. 



Computer 



Operating System 
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Compaq 

Silicon Graphics (RISC-Based) 

Sun (SPARC-Based) 

IBM 

Intel x86 based Personal 
Computer 



Tru64 UNIX 4.0E or later 
(Digital Unix) 
and 5.0 

IRIX 6.5 

Solaris 2.6, 7, or 8 
AIX 5.1 and above 
Red Hat/Linux 7.1 or 7.2 



Memory and Storage 

.nereis <ta t o , he ln , penaing „, ease „, g^E™' 5 """'"™ 

W,scons.n Package programs. Therefore, doublTyour storaae 
repayments to accommodate the data in both formats 9 

esse s a^ras: uses - »~ 

Software CD Installation Set Sizes in MB 



Set: 

Binary 

Base 

Total 



Tru64 






UNIX 






(Digital 






UNIX) 


IRIX 64 


IRIX32 


43 


49 


44 


41 


41 


41 


84 


90 


85 



Solaris 

76 
41 
117 



AIX Linux 

48 35 
41 41 
89 76 



Data Installation DVD 



Installation Set Sizes in MB (March 
2002) 



GenBank® 


10,819 


SWISS- PROT© 


353 


EMBL (Abridged) 


217 


SP-TrEMBL 


1,012 


NRL_3D Protein Structure 


45 


PIR® 


461 


GenBank Tags 
(EST and GSS)* 


43,715 


EMBL Tags 
(Abridged)* 


36 


BUVST 


3,693 


Lookup Indices 


1,203 


BLAST Tags* 


4,422 


Lookup Indices 


9,547 






for Tags* 


GenPept 


617 


Pfam 


302 
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DataBasic Total: 18,722 MB 
DataExtended Total: 76,442 MB 



wisconsinjxadc 



* Additional data supplied with the DataExtended service. 
X Windows Software (Optional) 

X Widows server W ' SCOns,n Packa 9 e graphics output, you need an 
The User Workstation 

either a modem or SreS connectTon t^hl h ° ^ You need 
terminal that preferably s abZ ?Tl °,? °^ SyStem and need a 
locally available printers th2 ™ 1 Y gra P hics - °"e or more 
required. Plotters^^^^XSnh^ ^ 9raphiCS 3re a,s ° 
or printer does not have g^hfea?S,KJ h,e ° UtPUt if the terminaI 

p^Sety,K 20% 

use SeoLao, the ^1^^%^^^ ^ t0 
Terminals 

A terminal can be provided in three different ways- 

Pro. Both provide Tek rate *,?„h M f cinMSI >. versaTerm 

"S^IKCS thG termina ' shoutd be 

the DEC \^ se^ text ' for sample 

able to display Tektronix r 'I !' the terminal snou,d be 
terminal) grajh^ or x w ^ow (X 

SeqLab/re^uirrth^^h^ ! ' Wh,Ch can accommodate 
softwariSSEd h ° St SySt€m have X Wind °w Manager 

Supported Graphics Terminals 

X terminals 

X Windows (for X servers) 

Tektronix 4014 
Tektronix 4105 
Tektronix 4107 
Tektronix 4207 

vT330 (ReGIS) 
VT340 (ReGIS) 
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Macintosh running VersaTerm Pro 
running SmarTerm 340 

Terminal Emulator Resource Addresses 

SmarTerm 

Esker, inc. 

465 Science Drive 

Madison, Wisconsin 53711 USA 

T *C (608) 273-6000 

Fax: (608) 273-8227 

Web: flLtfijVVjvym^e^com 

VersaTerm Pro 

Synergy Software 
2457 Perkiomen Avenue 
Read.ng, Pennsylvania 19606 USA 
Tel: (610) 779-0522 
Fax: (610) 370-0548 
Web: ww^syj^^^ 

Microcomputer X Server Recommendations 
Printers and Plotters 

£ t^ should be connected 

PostScript and Hewlett ParLTr ! 6 in loca tions near users Bnfh 
supported for this pufpose.^e K orooT ^ P^rs^re 
aCKa9e prints AS CII files on PostScript laser Dri'nrp 6 Wiscons,n 

Programs 3 ^S^S^SS^!^ t0 ter ^- Most 
printer and many terming E£ a pi^ n ° h 0n ^ Standard ASCII 
used to connect a printer or plotter ffTS™- 9h , P ° rt WhiCh can be 
used, extra setup may be needed Tto dir^ n'" 3 ' emulator wi " ° e 
to a pnnter or plotter attachedt^VpC^r ^ h ° St C ° m ^ er 

S£ SSSSSffv- « -or PostScript printers or on coior 

s^'fflKs^: rsr or p,ottin9 < the p - a °* 

•oca) pnnting and plotting devices o n t h^ a ? a , 9er defines the available 
easy for users to access 2^^,^*^ ^ * 

Supported Printers and Plotters 

Apple LaserWriter 

DEC Printserver PS20 
DEC LA210 

HP LaserJet Hi 
HP LaserJet IV 
HP7475 plotter 
HP7550 plotter 



©accelrys' 



' ***** i i^isrs^fi^ 



15 



£/1A/A4 » ~ 



IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 




matter of: 



AppL No. : 09/934,156 Confirmation No. 7387 

Applicant : David Roth Rigney 

Filed : August 21, 2001 

TC/A.U. : 1631 

Examiner : Cheyne D. Ly 

Response to : Office Action with mailing date of January 08, 2004 



ATTACHMENT 3 TO THE 
DECLARATION OF DAVID R> RIGNEY 



Title of Attachment: Book Chapter Entitled "Sequence Analysis Using GCG" 
Number of Pages: 24 

Description of Attachment: B.A. Butler, "Sequence Analysis Using GCG", Chapter 4 (pp. 74- 
97) in A.D. Baxevanis and B.F.F. OueUette, eds., Bioinformatics. A Practical Guide to the 
Analysis of Genes and Proteins. New York: Wiley Interscience, 1998 



equence Analysis Using GGG 



Barbara A. Butler 

Genetics Computer Group, Inc. 
Oxford Molecular Group 
Madison, Wisconsin 



INTRODUCTION 



The advent of rapid, economical nucleic acid sequencing methods revolutionized many sci- 
entific disciplines including molecular biology, genetics, and biochemistry (Gilbert, 1981; 
Sanger 1981) This technology also established a need for public databases to house the 
enormous amount of sequence information that was soon being generated in laboratories 
worldwide (Benson et al., 1997; Stoesser et al., 1997). The fields of bioinformatics and com- 
putational biology came of age with the establishment of these databases, since sequences 
submitted to them required analysis and annotation. In addition, existing database entries 
needed to be identified and retrieved by researchers wishing to study them further 

Bioinformatics can be described as the acquisition, analysis, and storage of biological 
information, specifically nucleic acid and protein sequences. Computational biology is the 
development of algorithms and computer programs integral to these endeavors. Both tields 
have grown dramatically in the past decade, driven by the enormous amount of data accu- 
mulating from whole-genome sequencing projects. Programs for analyzing sequences and 
searching databases are available from a number of sources, both commercial and aca- 
demic Packages for personal computers and Macintoshes are often expensive, especially 
for multiple users, and can lack a comprehensive array of programs for analysis and edit- 
ing Publicly available stand-alone (i.e., not part of a package) programs are inexpensive, 
in contrast to commercial programs, but they have to be downloaded and sometimes com- 
piled on the local machine, and users have to become familiar with the format for input 
sequences and leam how to run them effectively. Network access to selected programs has 
become available recently, but it is difficult to perform analyses requinng more than one or 
these programs. For example, depending on the software used, a researcher can run a data- 



Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins 
Edited by A.D. Baxevanis and B.F.F. Ouellette 

ISBN 0-471-19196-5, pages 74-97. Copyright © 1998 Wiley-Liss, Inc. 
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base search but cannot then align the sequences found by that search. It is also difficult to 
create an alignment of sequences and then edit that alignment. 

This chapter introduces and discusses an environment that provides interoperability 
among a large number of sequence analysis and database searching programs as well as 
access to sequence data from a variety of sources. The environment is SeqLab®, developed 
by the Genetics Computer Group (GCG), as part of the Wisconsin Package™. The Wiscon- 
sin Package, a comprehensive set of sequence analysis programs, is distributed with public 
nucleic acid and protein databases. SeqLab is a graphical user interface (GUI) that permits 
full access to Wisconsin Package programs and supported databases. In addition, it provides 
an environment for creating, displaying, editing, and annotating sequences. SeqLab can also 
be expanded to include other publicly and locally available programs and databases. 

Many of the analyses performed by Wisconsin Package programs are discussed in detail 
in other chapters of this volume, as are the databases distributed with the Wisconsin Pack- 
age and SeqLab. Therefore, this chapter emphasizes the environment within which data- 
base entries and local sequences can be accessed, the types of analysis that can be 
performed, and the means of editing and annotating these entries and sequences. 

THE WISCONSIN PACKAGE 

The Wisconsin Package is a comprehensive sequence analysis software package that con- 
sists of over 120 individual programs, each performing a single analytical task. Database 
entries from public and private databases as well as individual sequence files can be ana- 
lyzed with Wisconsin Package programs because there is a uniform format for sequences 
used as input to all programs. In addition, the output files from some programs are in a for- 
mat that permits them to be further analyzed with other programs. Because of this, and the 
modularity of the package as a whole, a user can analyze sequences in a number of differ- 
ent ways by using programs in different combinations. The appendix of this chapter lists 
and describes the most widely used programs. A complete listing and detailed description 
of all programs can be found in the Program Manual for the Wisconsin Package. 

The Wisconsin Package supports a number of UNIX platforms as well as Open VMS. 
General information about GCG, the Wisconsin Package, supported platforms, and hard- 
ware requirements can be found on the GCG home page, fwww.gcg.com/, and in the Wis- 
consin Package User's Guide. 

DATABASES THAT ACCOMPANY THE WISCONSIN PACKAGE 

GCG supports and distributes five databases, two nucleic acid and three protein, for use 
with the Wisconsin Package. These databases are in both GCG format (for use with most 
Wisconsin Package programs), and BLAST format (for use with the BLAST database 
searching program). Indices for the LookUp program, for database reference searching, are 
also provided. 

The two supported nucleic acid databases are the GenBank database (Benson et al., 
1997), provided in its entirety, and an abridged version of the EMBL Nucleotide Sequence 
Database (Stoesser et al., 1997), consisting only of sequences not present in GenBank. 
These two databases have been combined for searching purposes into a single, compre- 
hensive nucleotide database named GenEMBLPlus. This combined database includes the 
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GenBank and EMBL Nucleotide Sequence Database divisions for expressed sequence tag 
<P<iT\ seauence tae site (STS), and genome sequence survey (GSS) entries. It is possible 
L E !e^ 

EMBLPlus database without these divisions with the specification GenEMBL. 

See protein databases supported and distributed by GCG are the J Informa- 
tion Resource' (PR) International Protein Sequence Database (George «J^W£ 
SWISS-PROT Protein Sequence Databank (Bairoch and Apweiler, 1997), and SP-TrEMBL 
mliroch and Apweiler 1997). SP-TrEMBL is a joint venture of the European Bioinforma- 
SS^^Sno. Bairoch of the University of Geneva in Switzerland. It contams 
Is JS predicted translated regions noted in EMBL database entries but does not ^con- 
Z any entL already present in SWISS-PROT. SP-TrEMB ^entries 
SWISS-PROT conventions, and as these entries appear "^TOS^POT database*ey 
will be removed from SP-TrEMBL. These two databases, SWISS-PROT and SP-TrEMBL, 
^Len combined for searching purposes to create a comprehensive protein database 

"TetTe!" 

GenBank database release schedule) as part of the GCG Database Update Service. Alter- 
n^erWisconsin Package utility programs and scripts are available for downloading and 
S^bue releases on site. These programs can also be used to update databases 
be^een releases or to format private data into databases for use with the Wisconsui Pack- 
age llist and description of these utility programs can be found m the Wisconsin Pa kage 
System Support Manual. Databases in FASTA format can be used d ^^^ 
formatting with all programs included in the Wisconsin Package except the BLAST and 
LookUp programs. 



THE SEQLAB ENVIRONMENT 

SeqLab is a graphical user interface to the Wisconsin Package based on i OS JVMot if^ It 
allows access to most Wisconsin Package programs and all supporte * 
Windows-based environment. Use of SeqLab requires an X-termmal or X-£rve software 
running on a microcomputer. Recommendations for X-server software can be found on the 

GC A°f^^ 

prompt with the command se q lab. A window appears entitled SeqL*^ 
(Fieure 4 1) There are two modes in which this main window can appear. Main List 
mTe and Editor mode (referred to here as the."SeqLab Editor"). In Main List mode the 
SeqLab Main Window displays a list file containing the names of single-sequence list, 
mul^equence format (MSF), and rich-sequence format (RSF) files as well as data- 
base entries In Editor mode the SeqLab Main Window displays the 
these files and database entries. Users can toggle between the two modes with the Mode. 
* t L butl on the SeqLab Main Window (Figure 4.1). Both modes permit access 
to Wisconsin Package programs and supported databases, but from the SeqL^itor 
a user can also edit and annotate sequences. This chapter concentrates on the SeqLab 

Ed Across the top of the SeqLab Main Window is a menu bar; the menu options can be 

summarized as follows: 
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FIIq Edit Functions Extensions Options Windows 



Kalp 



List: 



/usr/users/share/butler/working. list 



Moda: | Editor a"| Display: | raaturas Coloring a | 1:1 




T7 



dromrna 



mi 



Btggcggataaagtaaatgtgtgcattctgggctccggcaactggcgttcggccat 



M 

KJCJn i ■ iiiii ii mi i MMMM Bga ei ■ i n I 

pos:0 col : 1 dromrna — > Columns 1-57 shown 



Figure 4.1 The SeqLab Main Window in Editor mode. 

File: Options for adding sequences from databases or directory files or for creating 
sequences de novo. 

Edit: Options for moving and editing sequences and performing simple operations. 
Functions: Wisconsin Package programs organized by analysis topics. 
Extensions: A list of additional programs, if any, that can be run from within SeqLab. 
Options: Preferences for displaying sequences and output, file management, and 
printing. 

Windows: A list of windows for output display, program monitoring, and features anno- 
tation. 

Help: Online help for Wisconsin Package programs and the SeqLab interface. 

In addition to the Mode option button, the SeqLab Main Window includes a Display 
option button for changing the color or shading of sequences displayed and a scale bar for 
changing their horizontal scale. A panel of icons offers an alternative method for selecting 
editing options, viewing sequence information, and setting protections. The majority of the 
space in this window, however, is reserved for displaying sequences (Figure 4.1). 

Adding Entries from Databases and Sequence Files from Directories 

A sequence must appear in the'SeqLab Main Window before it can be edited or analyzed 
with Wisconsin Package programs. Database entries are added either by entry name or by 
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menu Ne* J Z Databases from the extended menu that appears. A Database Browser 
menu. Next select uauo accession number of the desired 

t Add t^Main Window button and the Close button. This procedure car > be > abbreviated 
*X^r m ilarabbrevia to 

mouse commands.) ^ , . „ r , 

To add an entry from a database to the SeqLab Main Window: 

1 Select File; go to Add Sequences From, and click Databases. 

2. Type the entry name or accession number in the Database Specificauon text box of 
the Database Browser (Figure 4.2). 

3. Click Add to Main Window, then Close. 

Users can also add GCG-formatted sequence files to the list displayed in the SeqLab Main 
Window. 
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Figure 4.2 The Database Browser and Add Sequence windows for adding sequences to the SeqLab 
Editor. 
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To add a directory file to the SeqLab Main Window: 

1. Select File; go to Add Sequences From, and click Sequence Files. 

2. Select the appropriate filter in the Filter text box. (The default filter is * seq, which 
will display all files in a directory ending in .seq. Replace * seq with * to display all 
the files in a directory.) 

3. Select the appropriate directory from the Directory area. 

4. Click Filter. 

5. Select the files by name from the Files area of the Add Sequences window. 

6. Click Add, then Close. 

Reference information for database entries and individual sequences can be viewed by 
double-clicking on the name of the entry or sequence. This action opens the Sequence 
Information window. Information in any of the text boxes on this window can be edited, as 
necessary. For example, it is often convenient to rename a database entry or add an 
ID/accession number to a sequence that is part of a large project. 

Users can navigate within and among the sequences displayed in SeqLab with the arrow 
keys and horizontal and vertical scroll bars. Move to a residue within the sequence by typ- 
ing the number of the residue and pressing the return key. Many other shortcuts for navi- 
gating within the SeqLab Editor, including moving relative to the current cursor position, 
are detailed in the SeqLab Guide. 

Creating a New Sequence Entry 

Users can also enter new protein or nucleic acid sequences into SeqLab. 
To enter a new protein or nucleic acid sequence: 

1. Select File and go to New Sequence. 

2. Choose either DNA, RNA, or Protein from the New Sequence box. 

When the listing appears, click at the beginning of the entry and either type in new 
sequence information or paste in sequence information from another window. Add refer- 
ence information by double-clicking on the name of the new entry. This action opens the 
Sequence Information window. All the text boxes are editable, so in addition to renaming 
the entry, a description, author name, or ID/accession number can be included. General ref- 
erence information can be added to the large text box at the bottom of the window. 

Editing Existing Sequences 

It is impossible to accidentally insert or delete residues because existing sequences dis- 
played in the SeqLab Editor are protected. These protections can be changed, however, and 
when they have been removed, residues can be added and deleted, and it is possible to cut 
and paste sequences or regions of sequences between entries. 
To change the protections on a sequence: 

1. Select File and go to Sequence Protections. 

2. Select all the buttons in the Sequence Protections window and click OK. 
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TppeL to th 8 eLt of the island such that the overall alignment .conserved. A complete hst 
of editing operations is included in the Wisconsin Package SeqLab Gu.de. 



ANALYZING SEQUENCES WITH OPERATIONS AND 
WISCONSIN PACKAGE PROGRAMS 

Once sequences have been added and displayed in the SeqLab Main Window they can be 
Xed bv inning any Wisconsin Package program. The output files created by the pro- 
analyzed by run g ^ (see section entlt ied Viewing Output, 

f rVSJSS the f files can be added back to the SeqLab Editor or SeqLab List mode 
folTxlendTd £Z£^£n~* are also a few simple operations that can be run 
directly from the SeqLab Editor. 



Performing Simple Operations 

The Edit menu in SeqLab Editor mode enables users to perform simple operations on dfc- 
£y2^S£ without inning programs. These operations 

I7d seauences reversing and complementing nucleic acid sequences, calculating consen 

sequences, and 

the advantage of running rapidly and displaying results automatically in the SeqLab hrtitor 
S^E^be edited, annotated, and, most importantly, used as input to Wisconsm 
Package programs selected from the Functions menu. 
To select an operation: 

1. Select a sequence by name or a range of a sequence. 

2. Select Edit and go to the operation of choice. 



Running Wisconsin Package Programs 

X«dSvided by'anaLysis topic, The Map pK-gran,, fr<Kn ,he Mappu* funs- 
tions topic, is used here as an example. 

To run Map, a Wisconsin Package program: 

1 . Select a sequence by name or a region of a sequence with the cursor. 

2. Select Functions and go to Mapping. Then select Map. 
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Selecting a program by name opens a Program window for that program. Every Pro- 
gram window has the same basic format, which includes the name of the selected sequence, 
the parameters required to run the program, a panel of buttons for selecting and saving 
optional parameters, and buttons for running the program, closing the window, and obtain- 
ing help. The Program window for the program Map is shown on the left in Figure 4.3. 

Users can run a program with the default selections for required parameters or modify 
them with the buttons and text boxes on the Program window. In addition, each program 
has a unique set of optional parameters that will modify the analysis the program performs 
or change the way the output is displayed. These optional parameters are listed on the Pro- 
gram Options window, which is opened by selecting the Options button on the Program 
window. By selecting from required and optional parameters for the Map program, a user 
can select a subset of enzymes to include in a restriction map, opt for including only 
enzymes that produce a 5' overhang on that map, or choose to omit the reverse complement 
strand normally included as part of a restriction map. The Map Options window is shown 
on the right in Figure 4.3. 

Selecting the Run button on a Program window will run that program with the selected 
parameters and close the Program window. If a program is rerun during the same SeqLab 
session, the Program window will appear with all the previously selected parameters in 
place. Selected parameters can be saved between SeqLab sessions by selecting the Save 
Settings button. Selecting GCG Defaults from the Program window will reset the default 
parameter selections on both the Program and Program Options windows. All Program 
windows also include a Help button for accessing online help specific for that program. 



VIEWING OUTPUT 

Output files generated by programs run during a SeqLab session are listed in the Output 
Manager window (Figure 4.4). 




Figure 4.3 Left: Example of a Program window. For the Map program, this window is displayed by 
selecting Map from the Functions menu. Right: Example of a Program Options window. For the Map 
program, this window is displayed by selecting the Options button on the Map Program window. 
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Figure 4 4 An output file created by the Map program displayed in an Output Display 

gS) All output L created in SeqLab are displayed in the Output Manager wmdow (background). 



- 



To open the Output Manager window: 

1. Select Windows and go to Output Manager. 

From this window output files can be displayed or printed Click fte D^utt°r ^ dis- 
play a highlighted file. An example of a displayed output file is shown in Figure 4.4. Click 
the Print button to send the selected file to a networked pnnter. ...,,„ 
An output file generated in an earlier SeqLab session cannot be viewed or pruned , unless 
it is listed * the Output Manager window. Select the Add Text F.les button, or Add Graph- 
ics Files, and select the file by name from the file browser that appears. 
duce graphics output create files with ".figure" extensions. When a file of this type is 
setectSL display, it is translated for display in an X-window. When a file of this type is 
selected for printing, it is translated into either PostScript- or HPGL™, depending on the 

^Info^S 

Main List or Editor and used as input to Wisconsin Package programs. If such a file is 
Elected 1 the Output Manager window, the Add to Main List and Add to Editor buttons 
wUl be active (Figure 4.4). l4e selected output file cannot be added to these wmdows, the 

buttons will be inactive. 
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MONITORING PROGRAM PROGRESS AND 
TROUBLESHOOTING PROBLEMS 

Every program run during a SeqLab session is recorded in the Job Manager window (Fig- 
ure 4.5). This window can be accessed from the Windows menu bar on the SeqLab Main 
Window. 

To open the Job Manager window: 

1. Select Windows and go to Job Manager. 

The top half of the Job Manager window is a log of all the programs that have been run dur- 
ing the current SeqLab session. The status of any program can be monitored by selecting 
the programs by name. If a program fails to run for any reason, a message will appear in 
this window and a log file for that program will appear in the Output Manager window. It 
is also possible to stop a running program from this window. 



ANNOTATING SEQUENCES AND GRAPHICALLY DISPLAYING 
ANNOTATIONS IN THE SEQLAB EDITOR 

A unique feature of SeqLab is its link to the Features table of database entries. For exam- 
ple, nucleic acid database entries often have features for the locations of coding regions, 
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Figure 4.5 The Job Manager window. All programs run during a SeqLab session are listed in this 
window. 
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have features for the locations of l™^^^^ £ the Se,Lab Editor 
To select features Display options: 

1 Select the Display option button and then Features Coloring. 

2 Select the Display option button and then Graphics Features. 



2. 
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multiple-sequence alignment in 



F.gure 4.6 The SeoLab Editor displaying ^^XS^^ Fea,UreS 
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Another unique and extremely useful feature of the SeqLab Editor is the ability to add 
features and edit existing ones. This is done from the Sequence Features and Feature Edi- 
tor windows (Figure 4.6). 

To add a feature: 

1. Highlight a region with the cursor (or add ranges to the text boxes From and To, 
found in the Feature Editor). 

2. Select Windows and then Features. 

3. Select Add in the Sequence Features window. 

4. Select Shape and Color in the Feature Editor window. 

5. Type a name for the feature in the Keyword text box in the Feature Editor window. 

6. Type a detailed comment in the Comments area of the Feature Editor window. 

7. Click OK, then Close. 

To edit a feature: 

1. Select Windows and then Features. 

2. Select the feature to edit in the Sequence Features window. 

3. Select Edit in the Sequence Features window. 

4. Modify Shape, Color, Range, Keyword, or Comments in the Feature Editor window. 

5. Click OK, then Close. 



SAVING SEQUENCES IN THE SEQLAB EDITOR 

When a user exits SeqLab Editor mode, or saves editing work, the information is saved in 
a rich-sequence format (RSF) file. This is a new type of file that includes reference and fea- 
tures information as well as the sequence itself. The format of an RSF file enables features 
information to be displayed in the SeqLab Editor. RSF files can contain one or more 
sequence entries. If database entries are saved, copies of those entries (including all refer- 
ence and features table information) are included in the RSF file. RSF files created in this 
way are automatically added to the current list file displayed in SeqLab List mode and are 
stored in the user's working directory. 



EXAMPLES OF ANALYSES THAT CAN BE UNDERTAKEN 
IN SEQLAB 

Having access to many sequence analysis programs confers the ability to use them sequen- 
tially to answer related questions or to repeat an analysis after the input sequences have 
been edited. The advantage of having access to both public databases and local sequences 
is the ability to use them both in a single analysis without first having to transfer or refor- 
mat them. This section describes six kinds of sequence" analysis problems that can be 
solved with SeqLab. 
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Finding Open Reading Frames in Two mRNAs, Translating Them, 
and Aligning the RNAs and the Proteins 

acid sequences. program by selecting it from 

^ing fnanes in any of .he ^""^■^S^u.nJs displayed 

rather, functionally. 

Finding Related Entries in Databases Through Reference Searching 
and Aligning Them 

A user working with a member of a characterized sequence family may wish to find other 
STJi^SS. - i displayed Ln ft. Ouu». Manager »i»dow and *. added 
within the output file are shown m Figure 4.7. 



EXAMPLES OF ANALYSES THAT CAN BE UNDERTAKEN IN SEQLAB 87 



tffWp U« rt4CU » (cjjecq feutea* cntxtoi by oe=a. 





t K^MAVi tur " ~ — ' 

«-T*.» .Taly 21, IT.' 11:'0 .. 

rttrrTxrn '/raws i lit .v£s*:n 



:«t* («: i.i.'.ki : j:i h>. 



ItthKjuuipl 
Idbhjtliick 



AllKLKLlAlVALlL I HI*VOvllWW 
AlUCftKLUPVn.ACSIVI'TOUr 



MSILUWLIULLMVUVILU 
AVSiLUXLaH.UUWVt.LO 
AtSlLOKCLt-WiLALVUVLtV 



pao:0 eoH:l IdMOMoen 



cotncno 1-37 atom 



ES^iM Sh °r in . 9 3 database reference search usin 9 the LookU P P^ram. the output file 

from this search, and a multiple-sequence alignment of the entries that were found. The upper left-hand 

"""I 0 *- ^ middle window displa * s the results of tnis search, which 
tower left? d d ^ ^ ** ^ iS shown in the 



sequence alignment of the sequences most similar to the query, and generate a phylogram of 
the data. 

Add the query sequence to the SeqLab Editor and select the FASTA program from the 
Functions menu. FASTA (Pearson and Lipman, 1988) searches a database for sequences 
similar to a query sequence. The output file can be displayed from the Output Manager 
window and can be added directly to the SeqLab Editor. The best regions of local similar- 
ity between the database entries and the query sequence are noted in this output file and 
only those regions of each database entry can be displayed in the SeqLab Editor if a dis- 
play is desired. Unwanted entries can be deleted from the SeqLab Editor altogether. 

Select the PileUp program from the Functions menu to create a multiple-sequence 
alignment of these sequences. The output can be displayed from the Output Manager win- 
dow and added to the SeqLab Editor, overwriting the existing, unaligned sequences. This 
alignment can be edited if necessary, and useful features table information from the data- 
base entries can be added to the query sequence. 

Select the PaupSearch program from the Functions menu. This program provides a 
GCG interface with the tree-searching options in PAUP™ (Phylogenetic Analysis Using 
Parsimony) (Swofford, 1996). The PaupDisplay program provides a GCG interface to tree 
manipulation, diagnosis, and display options in PAUP. The output from the FASTA search 
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the alignment of the first six sequences, and the evolutionary tree generated from this align- 

ment are shown in Figure 4.8. 

and Searching a Database for Similar Sequences 
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sequence. Once the contig has been assembled, the user may wish to find open reading 
frames within the sequence, translate them, and look for similar sequences in a database. 

The programs of the Fragment Assembly System can be used to assemble overlapping 
sequence fragments. The GelStart program creates a project. The GelEnter program copies 
fragments into the project. The GelMerge program finds overlaps between the fragments 
and assembles them into contigs. The GelAssemble program is an editor for editing these 
contiguous units and resolving conflicts between the fragments. All these programs can be 
selected from the Functions menu. Once assembled, the consensus sequence'for the final 
contig can be saved as a Sequence file and added to the SeqLab Editor. 

Use the Map, Frames, TestCode (Fickett, 1982), or CodonPreference (Gribskov et al., 
1983) programs to predict coding regions within the sequence. (All these programs can be 
selected from the Functions menu.) Use the Select Range function of the Edit menu to 
select the ranges predicted by these programs and the Translate operation of the Edit menu 
to translate them to protein. These proposed translated regions can also be added as fea- 
tures in the nucleic acid consensus sequence. 

Select the protein sequence and then select BLAST (Altschul et al., 1990) from the 
Functions menu. BLAST searches databases for entries similar to a query sequence. Both 
remote and local searches are possible. The results can be displayed from the Output Man- 
ager window. If a local database is searched, the resulting file can be added to the SeqLab 
Editor or Main List window, allowing further analysis on the sequences found. 

Aligning Related Protein Sequences, Calculating a Consensus Sequence 
for the Alignment, Identifying a NoveJ Pattern in the Sequences 
and Searching a. Database for Sequences That Contain That Pattern, 
or Searching the Alignment Consensus for Known Protein Patterns 

A user who has identified a group of related sequences may wish to align them and calcu- 
late a consensus sequence for the alignment. If a conserved pattern can be found in the 
alignment, the user may wish to search a database for other sequences that contain that pat- 
tern. The user may also wish to search the calculated consensus sequence for known pro- 
tein patterns. 

Select the sequences to align and select the PileUp program from the Functions menu to 
create a multiple sequence alignment. The PileUp output file can be displayed from the 
Output Manager window and added to the SeqLab Editor. It is possible for a user to realign 
a region of the alignment and place that region back into the original alignment. To do this, 
highlight the region and rerun PileUp. Select "realign a portion of an existing alignment*' 
from the PileUp Options window. It might also be advantageous to select an alternate scor- 
ing matrix or different creation and extension penalties. The new output file will contain the 
original alignment, with the realigned region replacing the original alignment in that 
region. 

Calculate a consensus sequence for the alignment with the Consensus operation in the 
Edit menu. If a conserved pattern can be identified, select the FindPatterns program from 
the Functions menu. Cut the pattern from the consensus sequence, paste it into the Find- 
Patterns Pattern Chooser, and search a database for sequences containing that pattern. 

Alternatively, search the consensus sequence for known protein pattern motifs by run- 
ning the Motifs program. Motifs searches protein sequences for the known protein patterns 
listed in PROSITE, the PROSITE Dictionary of Protein Sites and Patterns (Bairoch et al., 
1997). If a motif is identified, add a feature to all the sequences, noting its position. An 
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alignment of protein sequences plus a consensus sequence is shown in Figure 4.9, along 
with the results of a Motifs search. 

Using Profiles for Similarity Searches and Aligning Related Sequences 

A new and expanding region of sequence analysis is profile technology. A profile is a 
posTon^c scoring matrix that contains information about all the rescues at each 
position in a sequence alignment. This is in contrast to a consensus sequence which con- 
S Volition at Tt the consensus residue at each position. Once made, a profile 
cTte uLo^o search a database, database division, or search set for sequences sumlar o 
^s^quences in the original alignment. It can also be used to align a stngle sequence to 

^S^rlleMake program (Gribskov et al., 1987, 1990) to create a profile of a se- 
quent S gnlt. tte ^Search program to sea** a database , wnhl be , prffle »d 
Lprofile^nents program todisplaythe 

fileGap program to align asequence to the profile (Gribskov etal., 1987 1990). ProfileMake, 
SSSct^rofileS^gments, and ProfileGap are all available from the FuncUons menu. 
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EXTENDING SEQLAB BY INCLUDING PROGRAMS THAT ARE 
NOT PART OF THE WISCONSIN PACKAGE 

Another key feature of SeqLab is the flexibility to insert additional programs in the envi- 
ronment. Briefly, the process entails obtaining an appropriate executable file for the pro- 
gram to be included and creating a configuration file that describes the required and 
optional parameters and formats the input and output files. Detailed instructionspn how to 
create a configuration file can be found in the Wisconsin Package System Support Manual. 
It is not necessary to link these stand-alone program executables to any procedures in the 
Wisconsin Package. With this option, it is possible to run any program compiled to run 
under the operating system of the computer running the Wisconsin Package from within 
SeqLab and to view its output as easily as if it were part of the Wisconsin Package. 
ClustalW (Higgins et al., 1996) is the example extension program included with version 
9.0 of the Wisconsin Package. Note that it is not a functional program unless the executable 
has been downloaded or built and the config file edited to point to the location of this file. 

Programs added to the SeqLab environment can be selected from the Extensions menu 
of the SeqLab Main Window. 
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Wisconsin Package programs are organized into topics based on scientific application The 
topics listed are present in the SeqLab Functions menu. Most, but not all, of the programs 
accessible through SeqLab are listed, along with a brief description. The GCG home page 
offers up-to-date information and a complete list of Wisconsin Package programs. 

Pairwise Comparison 

Gap: Uses the algorithm of Needleman and Wunsch (1970) to find the optimal global 
alignment of two sequences. 

BestFit: Uses the algorithm of Smith and Waterman (1981) to find the optimal local align- 
ment of two sequences. 

FrameAlign: Creates an optimal local alignment between a protein sequence and the 
codons in the three forward reading frames of a nucleotide sequence, adding gaps as 
necessary to maintain the reading frame. 

Compare/DotPlot: Compares two protein or nucleic acid sequences, creates a file that 
contains information about the regions of similarity between them, and displays these 
results graphically as a dot matrix of similarity. 

ProfileMake/ProfileGap: Creates a position-specific scoring table, called a profile that 
quantitatively represents the information from a group of aligned sequences. ProfileGap 
creates an optimal alignment between a profile and a sequence (Gribskov et al., 1990). 

Multiple Comparison 

PileUp: Creates a multiple sequence alignment from a group of sequences using progres- 
sive, pairwise alignments. It also creates a graphic file showing the clustering used to 
create the alignment. 

PlotSimilarity: Graphs the running average of the similarity scores of the sequences in a 
multiple sequence alignment. 

Database Reference Searching 

LookUp: Finds database entries by searching indexed fields such as Name, Accession 
Number, Author, Organism, Keyword, Title, Reference, Feature, Definition, Length or 
Date for descriptive terms (Etzold and Argos, 1993). 
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Database Sequence Searching 

BLAST: Searches a database for sequences similar to a query sequence (Altschul et al., 
1990). The query and the database searched can be either peptide or nucleic acid in any 
combination. The program can search databases on an individual user's computer or 
databases maintained at the National Center for Biotechnology Information (NCBI) in 
Bethesda, Maryland. 

FASTA: Searches a database for sequences similar to a query sequence. It was written by 
William Pearson and David Lipman (Pearson and Lipman, 1988). 

TFASTA: Searches a nucleotide database for sequences similar to a protein query 
sequence. It translates the database sequences in all six frames before performing the 
comparison (Pearson and Lipman, 1988). 

FrameSearch: Searches a nucleotide database or list file for sequences similar to a pro- 
tein query. It can also search a protein database or list file for sequences similar to a 
nucleotide query. For each sequence comparison, the program finds an optimal align- 
ment between the protein sequence and all possible codons on each strand of the 
nucleotide sequence, adding gaps to maintain the reading frame. 

ProfileMake/ProfileSearch/ProfileSegments: ProfileMake creates a position-specific 
scoring table, called a profile, that quantitatively represents the information from a 
group of aligned sequences. ProfileSearch uses this profile to search a database, data- 
base division, or list file for sequences similar to those that created the profile. Profile- 
Segments displays the local regions of similarity between the database entries and the 
profile (Gribskov et al., 1990). 

FindPatterns: Identifies sequences containing short patterns. Patterns can be defined 
ambiguously at each position and/or overall mismatching can take place. 

Editing and Publication 

Pretty: Varies the display of multiple-sequence alignments. It can also calculate a con- 
sensus sequence for the alignment. 

Publish: Varies the display of single or multiple sequences. A menu of options for display, 
translating, and noting identities is provided. 

MapSort/PlasmidMap: MapSort with the Plasmid option creates a file containing the 
locations of restriction enzyme recognition sites. This file can be graphically displayed 
with the PlasmidMap program Only circular restriction maps are possible. 



Evolution 

Distances/GrowTree: Creates a distance matrix of the pairwise corrected distances within 
a group of aligned sequences, expressed as a number of nucleotide or amino acid sub- 
stitutions per 100 residues and constructs a phylogram. 

PaupSearch: Provides a GCG interface to the tree-searching options in PAUP (Phyloge- 
netic Analysis Using Parsimony) (Swofford, 1996). 

PaupDisplay: Provides a GCG interface to tree manipulation, diagnosis, and display 
options in PAUP (Phylogenetic Analysis Using Parsimony) (Swofford, 1996). 

Diverge: Estimates the number of synonymous and nonsynonymous substitutions per site 
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between two nucleic acid sequences that code for proteins. It uses a variant of the 
method published by Li (Li, 1993; Pamilo and Bianchi, 1993). 

Fragment Assembly 

GelStart/GelEnter/GelMerge/GelAssemble: GelStart creates a fragment assembly proj- 
ect or mitialized an existing one. GelEnter copies or enters fragments into the project 
GelMerge finds overlaps between the fragments and assembles them- into contigs or 
contiguous regions. GelAssemble is an editor that displays the contigs for the resolution 
of conflicts between the fragments. 

GelView: Displays all the contigs of a project at a given time and the names of all the frag- 
ments contained in each contig. 

Pattern Recognition and Gene Prediction 

TestCode: Uses algorithms developed by Fickett (1982) to predict protein-coding regions 
based on the nonrandomness of the composition of a nucleic acid sequence at every 
third base. J 

CodonPreference: Predicts protein coding regions based on codon usage and third posi- 

T, <5L biaS ' C0d ° n frequencv t^ 1 ™ for several organisms are available (Gribskov et 
al. } 1"83). 

Frames: Graphically displays open reading frames for the six translation frames of a 
nucleic acid sequence based on the position of start and stop codons. 

FindPatterns: Identifies sequences containing short patterns. Patterns can be defined 
ambiguously at each position and/or overall mismatching can take place. 

Motifs: Finds known protein pattern motifs by searching protein sequences for the pat- 
terns defined in the PROSITE Dictionary of Protein Sites and Patterns (Bairoch et al., 

Composition: Determines the composition of nucleic acid or protein sequence(s) For 
nucleotide sequence(s), it also determines dinucleotide and trinucleotide content. 

CodonFrequency: Creates a codon frequency table from coding regions of sequences or 
existing codon usage tables. The output can be used with many Wisconsin Package pro- 
grams including CodonPreference. 

Importing/Exporting 

Reformat: Formats sequence files, symbol comparison tables, or enzyme data files for use 
with Wisconsin Package programs. It can also be used to modify the display of 
sequences. 

FromStaden: Converts a sequence file in Staden format (Staden, 1980) to GCG format If 
multiple sequences are present in the file, individual sequence files will be created. 

FromGenBank: Converts to GCG format a sequence file in GenBank flatfile format 
(Benson et al., 1997). If multiple sequences are present in the file, individual sequence 
files will be created. 

FromPIR: Converts a sequence file in PIR format (George et al., 1997) to GCG format If 
multiple sequences are present in the file, individual sequence files will be created. 
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FromFASTA: Converts a sequence file in FASTA format (Pearson and Lipman, 1988) to 
GCG format. If multiple sequences are present in the file, individual sequence files will 

be created. 

ToPIR: Converts a GCG-formatted sequence file or files to PIR format (George et al., 
1997). 

ToFASTA: Converts a GCG-formatted sequence file or files to FASTA format (Pearson 
and Lipman, 1988). 

ToStaden: Converts a GCG-formatted sequence file or files to Staden format (Staden, 
1980). 



Mapping 

Map: Displays both strands of a nucleic acid sequence with restriction enzyme cut points 
above the sequence and protein translations below. Map can also create a peptide map 
of an amino acid sequence. 

MapPlot: Graphically displays restriction enzyme recognition sites, one enzyme per line. 

MapSort: Predicts the putative size of fragments after digestion of a nucleic acid with one 
or more restriction enzymes. 

PeptideSort: Predicts the peptide fragments from digest of an amino acid sequence. It 
sorts the predicted peptides by weight, position, and relative retention times determined 
by high-performance liquid chromatography (HPLC). It also includes the composition 
of each peptide as well as a summary of the composition of the whole protein. 



Primer Selection 

Prime: Selects oligonucleotide primers for polymerase chain reaction (PCR) reactions, 
primer sequencing, and primer extension experiments. PCR is covered by U.S. Patents 
4,683,195 and 4,683,202, owned by Hoffmann-LaRoche. 



Protein Analysis 

CoilScan: Locates coiled-coil segments in protein sequences. 

HTHScan: Scans protein sequences for the presence of helix-turn-helix motifs, indica- 
tive of sequence-specific DNA-binding structures often associated with gene regulation. 

Isoelectric: Predicts and plots a titration curve for a protein sequence. 

ProfileScan: Uses a database of profiles to find motifs in protein query sequences (Grib- 
skov et al., 1990). 

PeptideSort: Predicts the peptide fragments from digest of an amino acid sequence. It 
sorts the predicted peptides by weight, position, and relative HPLC retention times. It 
also includes the composition of each peptide as well as a summary of the composition 
of the whole protein. 

PepPlot: Predicts secondary structure using the method of Chou and Fasman (Chou and 
Fasman, 1978). The predictions are in a series of parallel plots. Plots for hydropathy and 
hydrophobic moment are included. 

PeptideStructure/PlotStructure: Predicts and displays secondary structure antigenicity, 
flexibility, hydrophobicity, and surface probability for a protein sequence. 



APPENDIX 97 

SPScan: Scans protein sequences for the presence of secretory signal peptides (SPs). 
RNA Secondary Structure 

MFoId^PlotFold: Predicts and displays optimal and suboptimal secondary structures for 
an RNA molecule using the energy minimization method of Zuker (1989) 

StemLoop: Finds stems, or inverted repeats, within a sequence. The user specifies the 
SoTpTr ZZ . ^ minimUm "* l0 ° P SkeS ' md * e -inimurTnumL of 

Translation 

Translate: Translates nucleotide sequences into peptide sequences 

Ba niS ,at t : *" amino add Sequence mt0 a nucIeotide ^nce. The out- 

put display helps the user to recognize minimally ambiguous regions ftat may be good 
for constructing synthetic probes. im«tyoegooa 



